Analysis of Households with no Internet Access in the United States

Yukun Zheng, Yuxin Wen and Ziyue Su

INTRODUCTION

Internet is a necessity in our daily life and facilitates the world in so many aspects, from communication and sharing information with people, to online banking and shopping. Many people even view internet as a utility such as water, electricity and gas. However, do you know how many households in each state of the United States do not have internet, who are these people, and why they do not have internet?

With these questions in mind, we will analyze households with no internet access in the United States and deduce the potential causes. According to Kaggle, U.S. Census Bureau began asking internet use in American Community Survey in 2013; the dataset contains data for counties with population over 65,000.

Throughout this tutorial, we will collect, process and visualize the data presented in this dataset, as well as perform machine learning techniques on it. We aim to uncover the trends between households without internet connection and potential factors such as their ages, locations and incomes. We will hopefully inform the public that there still exist many households that do not have internet today.

REQUIRED TOOLS

We will need the following libraries for this project:

  1. pandas
  2. matplotlib.pyplot
  3. numpy
  4. sqlite3
  5. seaborn
  6. sklearn
  7. statsmodels
  8. scipy
  9. folium
  10. heatmap

We highly recommend referring to the following resources for more information about pandas/installation and python 3.6 in general:

  1. https://pandas.pydata.org/pandas-docs/stable/install.htm
  2. https://docs.python.org/3/
In [1]:
!pip install folium
!pip install graphviz
!conda install graphviz --y
!pip install pydotplus
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn import datasets
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score
from sklearn.metrics import explained_variance_score
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import folium
from folium.plugins import HeatMap
from numpy.polynomial.polynomial import polyfit
Collecting folium
  Downloading https://files.pythonhosted.org/packages/55/e2/7e523df8558b7f4b2ab4c62014fd378ccecce3fdc14c9928b272a88ae4cc/folium-0.7.0-py3-none-any.whl (85kB)
    100% |████████████████████████████████| 92kB 1.9MB/s ta 0:00:01
Requirement already satisfied: six in /opt/conda/lib/python3.6/site-packages (from folium)
Requirement already satisfied: numpy in /opt/conda/lib/python3.6/site-packages (from folium)
Requirement already satisfied: requests in /opt/conda/lib/python3.6/site-packages (from folium)
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Requirement already satisfied: jinja2 in /opt/conda/lib/python3.6/site-packages (from folium)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /opt/conda/lib/python3.6/site-packages (from requests->folium)
Requirement already satisfied: idna<2.8,>=2.5 in /opt/conda/lib/python3.6/site-packages (from requests->folium)
Requirement already satisfied: urllib3<1.24,>=1.21.1 in /opt/conda/lib/python3.6/site-packages (from requests->folium)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.6/site-packages (from requests->folium)
Requirement already satisfied: MarkupSafe>=0.23 in /opt/conda/lib/python3.6/site-packages (from jinja2->folium)
Installing collected packages: branca, folium
Successfully installed branca-0.3.1 folium-0.7.0
You are using pip version 9.0.3, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting graphviz
  Downloading https://files.pythonhosted.org/packages/1f/e2/ef2581b5b86625657afd32030f90cf2717456c1d2b711ba074bf007c0f1a/graphviz-0.10.1-py2.py3-none-any.whl
Installing collected packages: graphviz
Successfully installed graphviz-0.10.1
You are using pip version 9.0.3, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 4.5.8
  latest version: 4.5.11

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /opt/conda

  added / updated specs: 
    - graphviz


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2018.11.29 |       ha4d7672_0         143 KB  conda-forge
    libtool-2.4.6              |       h470a237_2         517 KB  conda-forge
    openssl-1.0.2p             |       h470a237_1         3.1 MB  conda-forge
    cairo-1.14.12              |       h276e583_5         1.3 MB  conda-forge
    certifi-2018.11.29         |        py36_1000         145 KB  conda-forge
    pango-1.40.14              |       he752989_2         527 KB  conda-forge
    pillow-5.3.0               |   py36hc736899_0        1018 KB  conda-forge
    xorg-libxpm-3.5.12         |       h470a237_2          56 KB  conda-forge
    freetype-2.9.1             |       h6debe1e_4         800 KB  conda-forge
    gettext-0.19.8.1           |       h5e8e0c9_1         3.5 MB  conda-forge
    harfbuzz-1.9.0             |       h04dbb29_1         1.1 MB  conda-forge
    graphviz-2.38.0            |       h08bfae6_9         9.7 MB  conda-forge
    fontconfig-2.13.1          |       h65d0f4c_0         320 KB  conda-forge
    glib-2.56.2                |       h464dc38_1         4.6 MB  conda-forge
    ------------------------------------------------------------
                                           Total:        26.8 MB

The following NEW packages will be INSTALLED:

    gettext:         0.19.8.1-h5e8e0c9_1  conda-forge
    graphviz:        2.38.0-h08bfae6_9    conda-forge
    libtool:         2.4.6-h470a237_2     conda-forge
    xorg-libxpm:     3.5.12-h470a237_2    conda-forge

The following packages will be UPDATED:

    ca-certificates: 2018.8.24-ha4d7672_0 conda-forge --> 2018.11.29-ha4d7672_0 conda-forge
    cairo:           1.14.12-h7636065_2   defaults    --> 1.14.12-h276e583_5    conda-forge
    certifi:         2018.8.24-py36_1     conda-forge --> 2018.11.29-py36_1000  conda-forge
    fontconfig:      2.13.0-hd36ec8e_5    conda-forge --> 2.13.1-h65d0f4c_0     conda-forge
    freetype:        2.8.1-hfa320df_1     conda-forge --> 2.9.1-h6debe1e_4      conda-forge
    glib:            2.56.1-h000015b_0    defaults    --> 2.56.2-h464dc38_1     conda-forge
    harfbuzz:        1.7.6-h5f0a787_1     defaults    --> 1.9.0-h04dbb29_1      conda-forge
    openssl:         1.0.2p-h470a237_0    conda-forge --> 1.0.2p-h470a237_1     conda-forge
    pango:           1.40.14-hd50be51_1   conda-forge --> 1.40.14-he752989_2    conda-forge
    pillow:          5.2.0-py36h2dc6135_1 conda-forge --> 5.3.0-py36hc736899_0  conda-forge


Downloading and Extracting Packages
ca-certificates-2018 |  143 KB | ####################################### | 100% 
libtool-2.4.6        |  517 KB | ####################################### | 100% 
openssl-1.0.2p       |  3.1 MB | ####################################### | 100% 
cairo-1.14.12        |  1.3 MB | ####################################### | 100% 
certifi-2018.11.29   |  145 KB | ####################################### | 100% 
pango-1.40.14        |  527 KB | ####################################### | 100% 
pillow-5.3.0         | 1018 KB | ####################################### | 100% 
xorg-libxpm-3.5.12   |   56 KB | ####################################### | 100% 
freetype-2.9.1       |  800 KB | ####################################### | 100% 
gettext-0.19.8.1     |  3.5 MB | ####################################### | 100% 
harfbuzz-1.9.0       |  1.1 MB | ####################################### | 100% 
graphviz-2.38.0      |  9.7 MB | ####################################### | 100% 
fontconfig-2.13.1    |  320 KB | ####################################### | 100% 
glib-2.56.2          |  4.6 MB | ####################################### | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Collecting pydotplus
  Downloading https://files.pythonhosted.org/packages/60/bf/62567830b700d9f6930e9ab6831d6ba256f7b0b730acb37278b0ccdffacf/pydotplus-2.0.2.tar.gz (278kB)
    100% |████████████████████████████████| 286kB 2.1MB/s ta 0:00:01
Requirement already satisfied: pyparsing>=2.0.1 in /opt/conda/lib/python3.6/site-packages (from pydotplus)
Building wheels for collected packages: pydotplus
  Running setup.py bdist_wheel for pydotplus ... done
  Stored in directory: /home/jovyan/.cache/pip/wheels/35/7b/ab/66fb7b2ac1f6df87475b09dc48e707b6e0de80a6d8444e3628
Successfully built pydotplus
Installing collected packages: pydotplus
Successfully installed pydotplus-2.0.2
You are using pip version 9.0.3, however version 18.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

1. DATA COLLECTION:

This is the data collection stage of the data life cycle. Throughout this step, we primarily focus on collecting data from valuable sources.

Here we are scraping data from the local csv file “kaggle_internet.csv” using pandas and then putting the data into a dataframe.

Kaggle has arranged the data into different columns as follows:

  1. county
  2. state
  3. geographic identifier for the county
  4. longitude of a point inside the county
  5. latitude of the point
  6. total population
  7. population of different races
  8. population of different education levels
  9. population living below poverty line
  10. population of median age
  11. gini index
  12. median household income
  13. median percent of income spent on rent
  14. percent of household without internet access

We will use pand.DataFrame to save the data. This data structure, which is a two-dimensional matrix, is so convenient since it can save different types of objects. There are also other powerful functions to process the data. More info from: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html.

We will manipulate and analyze the data later throughout this tutorial. The head of the original dataset is printed below:

In [2]:
# SCRAPE DATA FROM THE LOCAL CSV FILE
data = pd.read_csv("kaggle_internet.csv")
data.index = data.index + 1
data.head()
Out[2]:
county state GEOID lon lat P_total P_white P_black P_asian P_native ... P_some_high_school P_high_school_equivalent P_some_college P_bachelor_and_above P_below_poverty median_age gini_index median_household_income median_rent_per_income percent_no_internet
1 Anchorage Municipality AK 05000US02020 -149.274354 61.177549 298192 184841.0 16102.0 27142.0 23916.0 ... 8196.0 44804.0 66162.0 70713.0 18302 33.0 0.4018 85634 28.0 6.593887
2 Fairbanks North Star Borough AK 05000US02090 -146.599867 64.690832 100605 75501.0 4385.0 3875.0 7427.0 ... 1527.0 14725.0 24570.0 19257.0 9580 30.6 0.3756 77328 25.6 12.102458
3 Matanuska-Susitna Borough AK 05000US02170 -149.407974 62.182173 104365 86314.0 1019.0 1083.0 5455.0 ... 2755.0 21071.0 28472.0 12841.0 9893 34.2 0.4351 69332 29.6 11.156575
4 Baldwin County AL 05000US01003 -87.746067 30.659218 208563 180484.0 18821.0 914.0 1383.0 ... 10506.0 41822.0 46790.0 43547.0 23375 42.4 0.4498 56732 29.3 17.868167
5 Calhoun County AL 05000US01015 -85.822513 33.771706 114611 NaN NaN NaN NaN ... 8853.0 24761.0 26625.0 12909.0 18193 39.1 0.4692 41687 24.8 23.464932

5 rows × 23 columns

Below we will take you through the process of data tidying which is a critical step to structure the dataset and facilitate our future analysis.

2. DATA PROCESSING:

This is the data processing stage of the data life cycle. During this phase, we essentially attempt to “fix up” the organization of our data so that it will be easier and more readable to analyze.

You can get more information regarding data cleaning and tidying at: https://realpython.com/python-data-cleaning-numpy-pandas/.

Here we are basically reformatting our dataset in correspondence with the formatting provided by Kaggle. We do so by including the percentages of each population in new columns “percent_something” and then dropping the original population columns.

Performing this step in the data life cycle is crucial in ensuring facilitated manipulation and flexibility for future analysis.

In [3]:
# ADD THE NEW COLUMN "PERCENT_SOMETHING" TO GET THE PERCENTAGE OF POPULATION
# ALSO DROP THE ORIGINAL COLUMN
temp = data.columns.values[6:18]
for i in temp:
    temp_str = i[2:]
    
    # GET THE PERCENTAGE OF EACH TYPES OF PEOPLE
    data["percent_" + temp_str] = 100 * data[i] / data["P_total"]
    
    # DROP THE ORIGINAL COLUMN
    data.drop(i, axis=1, inplace  = True)

data.head()
Out[3]:
county state GEOID lon lat P_total median_age gini_index median_household_income median_rent_per_income ... percent_asian percent_native percent_hawaiian percent_others percent_below_middle_school percent_some_high_school percent_high_school_equivalent percent_some_college percent_bachelor_and_above percent_below_poverty
1 Anchorage Municipality AK 05000US02020 -149.274354 61.177549 298192 33.0 0.4018 85634 28.0 ... 9.102189 8.020336 2.571833 2.661037 0.749182 2.748565 15.025219 22.187718 23.713916 6.137656
2 Fairbanks North Star Borough AK 05000US02090 -146.599867 64.690832 100605 30.6 0.3756 77328 25.6 ... 3.851697 7.382337 0.499975 2.342826 0.918443 1.517817 14.636449 24.422245 19.141196 9.522390
3 Matanuska-Susitna Borough AK 05000US02170 -149.407974 62.182173 104365 34.2 0.4351 69332 29.6 ... 1.037704 5.226848 0.135103 0.311407 0.322905 2.639774 20.189719 27.281177 12.303933 9.479232
4 Baldwin County AL 05000US01003 -87.746067 30.659218 208563 42.4 0.4498 56732 29.3 ... 0.438237 0.663109 0.000000 0.704344 1.555885 5.037327 20.052454 22.434468 20.879542 11.207645
5 Calhoun County AL 05000US01015 -85.822513 33.771706 114611 39.1 0.4692 41687 24.8 ... NaN NaN NaN NaN 2.142028 7.724389 21.604384 23.230754 11.263317 15.873694

5 rows × 23 columns

As we can see, there are missing data in the dataset. We regard the type of missingness as missing at random (MAR), which means the probability of missing data is dependent on the observed data but not the unobserved data. We infer that it is missing at random since we believe some people would refuse to answer questions with respect to their ages, household incomes and education levels.

We will tidy the missing data by mean substitution, where we impute the average from observed cases for all missing values of a variable. More specifically, we will use “df.fillna by mean” to replace each NaN value with the mean of its column.

For more information about handling the missing data you should refer to: https://machinelearningmastery.com/handle-missing-data-python/.

The head of the resulting tidy dataset is printed below:

In [4]:
# REPLACE THE NAN DATA WITH THE MEAN OF THE COLUMN (MEAN SUBSTITUTION)
# BY USING DF.FILLNA BY MEAN TO FILL THE NAN DATA
temp_mean = data.mean()

for j in data.columns.values[6:18]:
    temp = []
    for i in data[j]:
        if i != i:
            temp.append(temp_mean[j])
        else:
            temp.append(i)
    
    data[j] = temp
        
data.head()
Out[4]:
county state GEOID lon lat P_total median_age gini_index median_household_income median_rent_per_income ... percent_asian percent_native percent_hawaiian percent_others percent_below_middle_school percent_some_high_school percent_high_school_equivalent percent_some_college percent_bachelor_and_above percent_below_poverty
1 Anchorage Municipality AK 05000US02020 -149.274354 61.177549 298192 33.0 0.4018 85634 28.0 ... 9.102189 8.020336 2.571833 2.661037 0.749182 2.748565 15.025219 22.187718 23.713916 6.137656
2 Fairbanks North Star Borough AK 05000US02090 -146.599867 64.690832 100605 30.6 0.3756 77328 25.6 ... 3.851697 7.382337 0.499975 2.342826 0.918443 1.517817 14.636449 24.422245 19.141196 9.522390
3 Matanuska-Susitna Borough AK 05000US02170 -149.407974 62.182173 104365 34.2 0.4351 69332 29.6 ... 1.037704 5.226848 0.135103 0.311407 0.322905 2.639774 20.189719 27.281177 12.303933 9.479232
4 Baldwin County AL 05000US01003 -87.746067 30.659218 208563 42.4 0.4498 56732 29.3 ... 0.438237 0.663109 0.000000 0.704344 1.555885 5.037327 20.052454 22.434468 20.879542 11.207645
5 Calhoun County AL 05000US01015 -85.822513 33.771706 114611 39.1 0.4692 41687 24.8 ... 3.903036 1.157285 0.197669 3.505234 2.142028 7.724389 21.604384 23.230754 11.263317 15.873694

5 rows × 23 columns

3. EXPLORATORY ANALYSIS AND DATA VISUALIZATION:

This is the exploratory analysis and visualization stage of the data life cycle where we attempt to plot our data in order to observe potential trends. We will also perform statistical analysis in this step in order to obtain better supporting evidence for trends that may be discovered.

Firstly, the map below pinpoints the locations of counties of different Gini index levels. According to Wikipedia, Gini index is a measure of the income or wealth distribution of a place’s residents and is the most commonly used measurement of inequality. There are altogether 820 counties within our dataset; the blue dots represent counties of Gini index that is less or equal to 0.4; the red dots represent counties of Gini index that is between 0.4 and 0.5; and the black dots represent counties of Gini index that is greater than 0.5.

In [5]:
map_osm = folium.Map(location=[39.29, -97.61], zoom_start=3.5)
map_osm
for i, r in data.iterrows():
    color = "black"
    if r["gini_index"] <= 0.4:
        color = "blue"
    elif r["gini_index"] <= 0.5:
        color = "red"
        
    folium.Circle(radius = 10, location = [r["lat"], r["lon"]], popup = "", color = color, fill = True).add_to(map_osm)
        
map_osm
Out[5]:

Below shows the locations of counties with the highest and lowest Gini index. The top 30 highest counties are labeled as red and the top 30 lowest are labeled as green. As we can see from the map, coastal areas have higher Gini index compared to other areas in the United States, which indicates that coastal areas have a larger gap between the wealthy and poor. You will know the Gini index of each county by simply moving your mouse over its tag!

In [6]:
#map the top 30 highest and lowest Gini index counties and observe their location
#put your mouse over the tag to see its gini index
data.sort_values("gini_index",inplace=True,ascending=False)
data.head()

map_osm = folium.Map(location=[39.29, -97.61], zoom_start=3.5)
map_osm
map1=map_osm
index=1
for i,r in data.iterrows():
    a=r['lat']
    b=r['lon']
    tooltip1=str(r['gini_index'])
    if index>=1 and index<=30:
        folium.Marker([a,b], popup='', icon=folium.Icon(color='red'),
                          tooltip=tooltip1).add_to(map1)
    if index>=790 and index <=820:
        folium.Marker([a,b], popup='', icon=folium.Icon(color='green'),
                          tooltip=tooltip1).add_to(map1)
    index=index+1
map1
Out[6]:

On the other hand, the map below shows the locations of counties with the highest and lowest percentages of median_age. The top 30 highest counties are labeled as red and the top 30 lowest are labeled as green. We can see that coastal areas (especially Florida) have the highest percentages of median_age. You will know the percentage of each county by simply moving your mouse over its tag!

In [7]:
#pay attention to bottom right
data.sort_values("median_age",inplace=True,ascending=False)
data.head()
map_osm = folium.Map(location=[39.29, -97.61], zoom_start=3.5)
map_osm
map1=map_osm
index=1
for i,r in data.iterrows():
    a=r['lat']
    b=r['lon']
    tooltip1=str(r['median_age'])
    if index>=1 and index<=30:
        folium.Marker([a,b], popup='', icon=folium.Icon(color='red'),
                          tooltip=tooltip1).add_to(map1)
    if index>=790 and index <=820:
        folium.Marker([a,b], popup='', icon=folium.Icon(color='green'),
                          tooltip=tooltip1).add_to(map1)
    index=index+1
map1
Out[7]:

Below displays the locations of counties with the highest and lowest percentages of households with no internet access. The top 30 highest counties are labeled as red and the top 30 lowest are labeled as green. It is safe to conclude that the north of the United States has lower percent_no_internet compared to the south, and we see no correlation between median_age and percent_no_internet. You will know the percentage of each county by simply moving your mouse over its tag!

In [8]:
data.sort_values("percent_no_internet",inplace=True,ascending=False)
data.head()

map_osm = folium.Map(location=[39.29, -97.61], zoom_start=3.5)
map_osm
map1=map_osm
index=1
for i,r in data.iterrows():
    a=r['lat']
    b=r['lon']
    tooltip1=str(r['percent_no_internet'])
    if index>=1 and index<=30:
        folium.Marker([a,b], popup='', icon=folium.Icon(color='red'),
                          tooltip=tooltip1).add_to(map1)
    if index>=790 and index <=820:
        folium.Marker([a,b], popup='', icon=folium.Icon(color='green'),
                          tooltip=tooltip1).add_to(map1)
    index=index+1
map1
Out[8]:

The heatmap below shows the graphical representation of households that do not have internet connection. We can see that a great part of these households are from the east.

In [9]:
#simple heatmap
#zoom in
map_osm = folium.Map(location=[39.29, -97.61], zoom_start=3.5)
map_osm
map1=map_osm
x=data[['lat','lon']].values.tolist()
HeatMap(x).add_to(map1)
map1
Out[9]:

4. MACHINE LEARNING AND ANALYSIS:

During this phase of the data life cycle, we attempt to perform various modeling techniques such as linear regression to obtain a predictive model of our data. This allows us to predict values for data outside of the scope of our data.

Please visit http://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/ for more information on machine learning techniques.

Here we will use numpy.polynomial.polynomial to investigate the relationship between poverty and households not having internet access.

First plot shows the correlation within the entire nation, while the rest show the correlations within each state. However, majority of the graphs below indicate a positive correlation between gini_index and percent_no_internet. So it is highly possible that areas with larger disparities between the rich and poor have a larger number of households that lack internet connection.

In [10]:
# the plot of gini_index vs. percent_no_internet
plt.plot(data["percent_no_internet"], data["gini_index"], "ro")
b, m = polyfit(data["percent_no_internet"], data["gini_index"],1)
plt.plot(data["percent_no_internet"],b+m*data["percent_no_internet"],'-')
plt.title("gini_index vs. percent_no_internet")
plt.ylabel('percent_no_internet')
plt.xlabel('gini_index')
plt.show()
In [11]:
for i in data["state"].unique():
    temp = data.loc[data["state"] == i]

    regression = linear_model.LinearRegression()
    reg = smf.ols('gini_index ~ percent_no_internet', data = temp).fit()

    year_vector = [temp['gini_index'].tolist()]
    lifeExp_vector = [temp['percent_no_internet'].tolist()]
    year = np.transpose(year_vector)
    lifeExp = np.transpose(lifeExp_vector)

    regression = regression.fit(year, lifeExp)
    plt.plot(year, regression.predict(year))
    plt.scatter(temp["gini_index"], temp["percent_no_internet"], s=np.pi*6, alpha=0.5)
    plt.title(i)
    plt.ylabel('percent_no_internet')
    plt.xlabel('gini_index')
    plt.show()
    print('Coefficient:', regression.coef_[0][0])
Coefficient: 382.777491604
Coefficient: -13.8724375234
Coefficient: 13.6476748597
Coefficient: 88.1010060635
Coefficient: 93.9535099791
Coefficient: 66.1803087262
Coefficient: 18.127031613
Coefficient: 59.2427372518
Coefficient: 12.7113549328
Coefficient: 3.14197461575
Coefficient: -27.3070478521
Coefficient: 65.7172581066
Coefficient: 58.8065049515
Coefficient: 12.3277516973
Coefficient: 9.67768207381
Coefficient: 70.6139569207
Coefficient: 28.5169520921
Coefficient: 57.0017778822
Coefficient: 27.0392365902
Coefficient: -139.449936453
Coefficient: -11.9718224792
Coefficient: 8.17249130225
Coefficient: 39.4129850273
Coefficient: -5.44544370471
Coefficient: 52.2006736772
Coefficient: 10.8606946205
Coefficient: 100.668188608
Coefficient: -36.5680916496
Coefficient: -0.808008904589
Coefficient: 35.8419402753
Coefficient: 63.7745839594
Coefficient: -13.3083555065
Coefficient: -24.4346920587
Coefficient: -39.9861376168
Coefficient: 7.81955538538
Coefficient: -38.2130017695
Coefficient: -13.9071234141
Coefficient: 37.9297599559
Coefficient: 42.8754634913
Coefficient: 0.0
Coefficient: 260.741411407
Coefficient: -140.871395497
Coefficient: -23.4714625959
Coefficient: 44.547562955
Coefficient: 37.1246234274
Coefficient: 137.865953936
Coefficient: 33.4019049476
Coefficient: -9.12123648978
Coefficient: 37.3879505863
Coefficient: 8.81278813702
Coefficient: 0.0

This plot reveals a positive correlation between gini_index and percent_below_poverty, which implies that areas with larger gaps between the rich and poor have higher rates of households living below poverty.

We already know that percent_no_internet correlates with gini_index positively, and since percent_below_poverty also correlates with gini_index positively, it is highly likely that poverty is one of the many factors that leads to households not having internet access.

In [12]:
plt.plot(data["percent_no_internet"], data["percent_below_poverty"], "ro")
b, m = polyfit(data["percent_no_internet"], data["percent_below_poverty"],1)
plt.plot(data["percent_no_internet"],b+m*data["percent_no_internet"],'-')
plt.title("gini_index vs. percent_below_poverty")
plt.ylabel('percent_below_poverty')
plt.xlabel('gini_index')
plt.show()

Here we process the data so we can use it to perform marchine learning later. In this step, we will handle the missing data by dropping the NaN values, and will investigate three potential factors which are age, location and income.

In [13]:
#drop nan data, rows number: 820->605
data.dropna(axis = 0, inplace = True)

#data.apply(LabelEncoder().fit_transform)
data['lon']=LabelEncoder().fit_transform(data['lon'])
data['lat']=LabelEncoder().fit_transform(data['lat'])
data['P_total']=LabelEncoder().fit_transform(data['P_total'])
data['median_age']=LabelEncoder().fit_transform(data['median_age'])
data['median_household_income']=LabelEncoder().fit_transform(data['median_household_income'])
data['percent_no_internet']=data['percent_no_internet']*100
data['percent_no_internet']=LabelEncoder().fit_transform(data['percent_no_internet'])
X = data[ ["lon", "lat", "P_total", 
           "median_age", "median_household_income" ]].values
y = data["percent_no_internet"].values

y
Out[13]:
array([817, 816, 815, 814, 813, 812, 811, 810, 809, 808, 807, 806, 805,
       804, 803, 802, 801, 800, 799, 798, 797, 796, 795, 794, 793, 792,
       791, 790, 789, 788, 787, 786, 785, 784, 783, 782, 781, 780, 779,
       778, 777, 776, 775, 774, 773, 772, 771, 770, 769, 768, 767, 766,
       765, 764, 763, 762, 761, 760, 759, 758, 757, 756, 755, 754, 753,
       752, 751, 750, 749, 748, 747, 746, 745, 744, 743, 742, 741, 740,
       739, 738, 737, 736, 735, 734, 733, 732, 731, 730, 729, 728, 727,
       726, 725, 724, 723, 722, 721, 720, 719, 718, 717, 716, 715, 714,
       713, 712, 711, 710, 709, 708, 707, 706, 705, 704, 703, 702, 701,
       700, 699, 698, 697, 696, 695, 694, 693, 692, 691, 690, 689, 688,
       687, 686, 685, 684, 683, 682, 681, 680, 679, 678, 677, 676, 675,
       674, 673, 672, 671, 670, 669, 668, 667, 666, 665, 664, 663, 662,
       661, 660, 659, 658, 657, 656, 655, 654, 653, 652, 651, 650, 649,
       648, 647, 646, 645, 644, 643, 642, 641, 640, 639, 638, 637, 636,
       635, 634, 633, 632, 631, 630, 629, 628, 627, 626, 625, 624, 623,
       622, 621, 620, 619, 618, 617, 616, 615, 614, 613, 612, 611, 610,
       609, 608, 607, 606, 605, 604, 603, 602, 601, 600, 599, 598, 597,
       596, 595, 594, 593, 592, 591, 590, 589, 588, 587, 586, 585, 584,
       583, 582, 581, 580, 579, 578, 577, 576, 575, 574, 573, 572, 571,
       570, 569, 568, 567, 566, 565, 564, 563, 562, 561, 560, 559, 558,
       557, 556, 555, 554, 553, 552, 551, 550, 549, 548, 547, 546, 545,
       544, 543, 542, 541, 540, 539, 538, 537, 536, 535, 534, 533, 532,
       531, 530, 529, 528, 527, 526, 525, 524, 523, 522, 521, 520, 519,
       518, 517, 516, 515, 514, 513, 512, 511, 510, 509, 508, 507, 506,
       505, 504, 503, 502, 501, 500, 499, 498, 497, 496, 495, 494, 493,
       492, 491, 490, 489, 488, 487, 486, 485, 484, 483, 482, 481, 480,
       479, 478, 477, 476, 475, 474, 473, 472, 471, 470, 469, 468, 467,
       466, 465, 464, 463, 462, 461, 460, 459, 458, 457, 456, 455, 454,
       453, 452, 451, 450, 449, 448, 447, 446, 445, 444, 443, 442, 441,
       440, 439, 438, 437, 436, 435, 434, 433, 432, 431, 430, 429, 428,
       427, 426, 425, 424, 423, 422, 421, 420, 419, 418, 417, 416, 415,
       414, 413, 412, 411, 410, 409, 408, 407, 406, 405, 404, 403, 402,
       401, 400, 399, 398, 397, 396, 395, 394, 393, 392, 391, 390, 389,
       388, 387, 386, 385, 384, 383, 382, 381, 380, 379, 378, 377, 376,
       375, 374, 373, 372, 371, 370, 369, 368, 367, 366, 365, 364, 363,
       362, 361, 360, 359, 358, 357, 356, 355, 354, 353, 352, 351, 350,
       349, 348, 347, 346, 345, 344, 343, 342, 341, 340, 339, 338, 337,
       336, 335, 334, 333, 332, 331, 330, 329, 328, 327, 326, 325, 324,
       323, 322, 321, 320, 319, 318, 317, 316, 315, 314, 313, 312, 311,
       310, 309, 308, 307, 306, 305, 304, 303, 302, 301, 300, 299, 298,
       297, 296, 295, 294, 293, 292, 291, 290, 289, 288, 287, 286, 285,
       284, 283, 282, 281, 280, 279, 278, 277, 276, 275, 274, 273, 272,
       271, 270, 269, 268, 267, 266, 265, 264, 263, 262, 261, 260, 259,
       258, 257, 256, 255, 254, 253, 252, 251, 250, 249, 248, 247, 246,
       245, 244, 243, 242, 241, 240, 239, 238, 237, 236, 235, 234, 233,
       232, 231, 230, 229, 228, 227, 226, 225, 224, 223, 222, 221, 220,
       219, 218, 217, 216, 215, 214, 213, 212, 211, 210, 209, 208, 207,
       206, 205, 204, 203, 202, 201, 200, 199, 198, 197, 196, 195, 194,
       193, 192, 191, 190, 189, 188, 187, 186, 185, 184, 183, 182, 181,
       180, 179, 178, 177, 176, 175, 174, 173, 172, 171, 170, 169, 168,
       167, 166, 165, 164, 163, 162, 161, 160, 159, 158, 157, 156, 155,
       154, 153, 152, 151, 150, 149, 148, 147, 146, 145, 144, 143, 142,
       141, 140, 139, 138, 137, 136, 135, 134, 133, 132, 131, 130, 129,
       128, 127, 126, 125, 124, 123, 122, 121, 120, 119, 118, 117, 116,
       115, 114, 113, 112, 111, 110, 109, 108, 107, 106, 105, 104, 103,
       102, 101, 100,  99,  98,  97,  96,  95,  94,  93,  92,  91,  90,
        89,  88,  87,  86,  85,  84,  83,  82,  81,  80,  79,  78,  77,
        76,  75,  74,  73,  72,  71,  70,  69,  68,  67,  66,  65,  64,
        63,  62,  61,  60,  59,  58,  57,  56,  55,  54,  53,  52,  51,
        50,  49,  48,  47,  46,  45,  44,  43,  42,  41,  40,  39,  38,
        37,  36,  35,  34,  33,  32,  31,  30,  29,  28,  27,  26,  25,
        24,  23,  22,  21,  20,  19,  18,  17,  16,  15,  14,  13,  12,
        11,  10,   9,   8,   7,   6,   5,   4,   3,   2,   1,   0])

Here we use r2 score and explained variance score to evaluate the accuary of our models, and the highest we could possibly get is 1. The result is as follows:

  1. the r2 score for decision tree is 0.894689179089
  2. the explained variance score for decision tree is 0.895552965701
  3. the r2 score for random forest is 0.869185062941
  4. the explained variance score for random forest is 0.870612642108

The four numbers we get are all very close to 1, which indicates that our models are rather accurate.

In [14]:
kf=KFold(n_splits=10)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    m,n=X_test.shape
    y_train, y_test = y[train_index], y[test_index]
    clf = tree.DecisionTreeRegressor()
    clf = clf.fit(X_train, y_train)
    
    rf = RandomForestRegressor()
    rf.fit(X_train, y_train)

#use r2 score to evaluate the accuary of models
#best possible r2 score is 1.0
#in our case, these are two very good r2 score. 

#Also use explained_variance_score 
y_prediction1=clf.predict(X)
print(r2_score(y, y_prediction1))
print(explained_variance_score(y,y_prediction1))

y_prediction2=rf.predict(X)
print(r2_score(y, y_prediction2))
print(explained_variance_score(y,y_prediction2))
0.917400869504
0.921687095198
0.884312289487
0.8899745882

Below represents a visualization of the decision tree:

In [15]:
#clf tree viz
#took a really long time to run
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dd = StringIO()
export_graphviz(clf, out_file=dd, filled=True, rounded=True, 
                feature_names=["lon", "lat", "P_total", 
           "median_age", "median_household_income"])
viz1 = pydotplus.graph_from_dot_data(dd.getvalue())  
Image(viz1.create_png())
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.673262 to fit

Out[15]:

Forest is a set of plenty of trees so it is rather complicated to visualize it. But we can still get a general idea by visualizing the first tree in our forest as follows:

In [16]:
dd = StringIO()
export_graphviz(rf.estimators_[0], out_file=dd, filled=True, rounded=True, 
                feature_names=["lon", "lat", "P_total", "median_age", "median_household_income"])
viz2 = pydotplus.graph_from_dot_data(dd.getvalue())  
Image(viz2.create_png())
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.862698 to fit

Out[16]:

5. INSIGHT AND POLICY DECISIONS:

This is the phase of the life cycle where we attempt to utilize our data analysis to draw conclusions. Although we have not yet performed in-depth machine learning and statistical analysis, we can still observe the trends and deduce the potential causes.

Based on our observations throughout our analysis and modeling, we can suggest that:

  1. Over a quarter of households in many counties are still lacking internet connection.
  2. The north of the United States has lower percent_no_internet compared to the south.
  3. A great part of the households with no internet are from the east.
  4. Poverty is one of the many factors that leads to households not having internet connection.

Overall, we want to inform the public that there are still many households today that do not have internet. The government should endeavor to provide internet access for low-income families, so they will also be able to enjoy the convenience that internet brings to us today.

Lastly, if we conducted further analysis, we would expand our dataset and investigate other factors such as the education levels of households.